Stroke Prediciton

Lab 1 report submission for CS7324_SP21

Team members: Jinyu Du, Hessam Emami

Business overview

According to American Centers for Disease Control and Prevention, stroke can happen at any age of a person. "Every year, more than 795,000 people in the United States have a stroke." The good news is that people can protect themselves by understanding and controlling the risk factors for stroke [1]. Mayo Clinic defines stroke this way: "A stroke occurs when the blood supply to part of your brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die in minutes.". It points out that some potentially treatable stroke risk factors are obesity, lack of physical activity, high blood pressure, cigarette smoking, high cholesterol, diabetes, cardiovascular disease, etc [2]. It is of great interest to analyze the stroke prediction dataset available and find out the roles each attribute play in stroke risk prediction.

The analysis for lab 1 is very meaningful and data will speak the truth. Findings and conclusions from the analysis will have the following benefits: 1) We will gain first-hand statistics and visualization about stroke risk factors. 2) Findings are expected to provide solid evidence for stroke risk prediction for medical physicians and heathcare professionals, who stay in the front line of patient care and should have access to accurate information for stroke prevention to minimize misdiagnosis. 3) Discoveries can also benefit fitness professionals. For example, fitness trainers can make recommendations and create training programs tailored for clients to help lower the risk of stroke. 4) Conclusions and prediction models can be incorported into electronic gadgets such as fitbit or apple watch to help monitor people's stoke riks factors and alert pepole when the risk is high.

Measure of success

Stroke misdiagnosis is a major healthcare concern, with initial misdiagnosis estimated to occur in 9% of all stroke patient cases in the emergency setting [4]. Each year about 1.2 million people in the US may have a stroke or are at a high risk of an impending stroke. If the misdiagnosis rate can be reduced only by 1 percent, we can save approximately 12,000 lives annually.

The high probability of incorrect diagnoses encourages us to minimize false positive and false negative stroke diagnoses in the prediction model. Because the model will be used to predict rather than diagnose, false positive cases can be tolerated much more than false negative ones. In other words, if a patient has the potential for a stroke, we prefer to alert the physicians and the patients for further examination.

An overview of the dataset

The Stroke Prediction dataset downloaded from Kaggle was chosen for lab 1. This dataset includes patients' information that may be helpful for predicting if a patient is at risk of stroke.

The dataset can be accessed from here:

Stroke Prediction Dataset

Dataset Author: fedesoriano

The list below describes the attrubutes in the dataset [3]: